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INTRODUCTION 


During the 3480 Early ship program ( ESP ) in late 1984, it was decided 
that selected accounts would send, on a weekly basis, a LOGREC tape con- 
taining all 3480 related information in the particular account. 


Thus, LOGREC tapes were generated by dumping system SYS1.LOGREC data and 
then, sending it thru IBTS to Valencia PLant. 


Major objetives at the time were: 


1. Have a better knowledge of Machine and Media behaviour in a real Cus- 
tomer environment. | 


2. Provide a close follow-up of Subsystem performance by Engineering 


functions, assisting, when required, on-site CE's in a fast response to 
problems resolution. 


The strategy proved to be extremely useful, as measured by the level of 
satisfaction achieved on ESP Customers and the number of problems de- 
tected and, thus corrected, on GA machines. 


LOGREC data received in Valencia was loaded on individual data bases 
( one per Customer ) and then processed using statistical analysis pro- 
grams ( SAS ), common in both Tucson and Valencia. Aditional " one 
shot " programs were developed to solve specific problems. 


ESP was over, but necessity for increasing reliability figures still re- 


mains, in order to keep the 3480 as the leading Tape Unit in the Market, 
in terms of reliability and serviciability. 


PROGRAM OBJECTIVES 


Reliability Assurance program for 3480 Tape Subsystem started in 
Valencia in September 1985. Leaded by Valencia Project Office, EMEA 
Countries were requested to select major accounts and send LOGREC data 
to Valencia on a regular basis. 


Most Customers selected were RELIABILITY PLUS Subscribers, which in turn 
added aditional benefits as describer hereafter. 


Reliability Assurance program major objetives are: 


1. Identify major detractors to higher reliability figures. 


2. Evaluate overall Media performance. 
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3. Define Subsystem Reliability expectations after extended periods of 
usage. 


4. Improve Reliability figures, as published by R PLUS, in terms of 
USE/HARD FAIL 


5. Identify non device-related failures being flagged as Hard Fails and, 
subsequently, proposing modifications to R PLUS algorithms. 


6. Correlate 3480 reliability, as measured by R PLUS, with expected IBM 
RAS criteria specifications. 


6. Assist CE on resolution of specific problems. 


OBR & MDR RECORDS 


As in other equipments, OBR and MDR records contain all information re- 
lated to device errors and statistical data. These records are stored in 
SYS1.LOGREC data set, usually on DASD, and constitutes the primary 
source of data for EREP and other analysis programs ( R PLUS ). 


1. OBR's ( Outboard Records ) contents: 


o 32 bytes of sense data for failure isolation ( FSC, Drive error 
code, etc ). These 32 bytes are presented by the Control Unit 
to the Channel each time a permanent error occurs ( not recove- 
red by the CU ) or the CU is in forced logging mode. 


o Date, time, jobid, volser, CCW, OBRSW, etc which are added by 
the system itself. 


2. MDR's ( Miscelaneus data records ) contents: 


o 32 bytes of environmental data containing.... 


..number of WTE's, RTE's, CWTE's, CRTE's, ERG's, RBLKS, WBLKS, 
RMBYTE, WMBYTE, RBLKCORR, WBLKCORR for each JOBID and presented 
to the Channel each time an overflow condition exists ( ERA=2A ) 
or a " Rewind/Unload " cmd ( ERA=2B ) is issued by the Channel. 


o Date, Time, Volser, CPUSER, etc. 


Following is a detailed description of 3480 OBR/MDR records. 
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1 MDR RECORD TYPE FOR THE 3480 MAGNETIC TAPE SUBSYSTEM 
} @1 CLASRC /* RECORD TYPE - MDR IS 90X AND 91X */ 
= @2 SYSREL /* SYSTEM RELEASE LEVEL * / 
@3  SWITCHO /** RECORD SWITCH ve / 
@4 SWITCH1 /* RECORD SWITCH */ 
@5 SWITCH2 /* RECORD SWITCH. MDR FOR 3480 = 41X te 
@6 SWITCH3 /* RECORD SWITCH % / 
| @7 MRCDCNT  /* SEQ. NO. AND PHYSICAL RECORDS CNT te / 
@9 DATE /* DATE */ 
@13 TIME /* TIME ve / 
f @14 HRS /* TIME IN HOURS */ 
@15 MIN /* TIME IN MINUTES WITHIN THE HOUR ¥ / 
@16 SEC /* TIME IN SECONDS WITHIN THE MINUTE ¥ / 
@17 VERNO /* MACHINE VERSION CODE * / 
@18 CPUSER /* CPU SERIAL NUMBER */ 
@21 CPUMOD /* CPU MODEL NUMBER ( 0158, 3081, ETC ) */ 
4 @23 SPACE1 /* RESERVED */ 
i @25 BUFRECID /* DEVICE ADDRESS / DEVICE NUMBER ¥ / 
ss @27 VOLSER /* VOLUMEN SERIAL */ 
i @33 SPACE2 /* RESERVED % / 
| @37 BLKLEN /* BLOCK LENGTH * / 
@39 MDROO /* UNIT CHECK % / 
r @40 MDRO1 /* DEVICE STATUS % / 
_ @41 MDRO2 /* DATA PATH & ERROR POSITIONING */ 
| @42 MDRO3 /* ERP ACTION CODE = ERA ( 2A OR 2B ) * / 
L @43 MDRO4 /* BLOCK ID ¥ / 
@46 MDRO?7 /* FORMAT 20 = SENSE. 21 = BFR LOG * / 
I @47 MDRO8 /* READ FWD DATA CHECKS */ 
@48 MDROQ /* READ BKWD DATA CHECKS */ 
@49 MDR10 /* WRITE DATA CHECKS */ 
@50 MDR11 /* READ BLOCKS CORRECTED ON FLY (WRT ECC) */ 
i @51 MDR12 /* WRITE BLOCKS CORRECTED ON FLY (RD ECC) */ 
@52 MDR13 /* CU EQUIPMENT CHECKS %/ 
@53 MDR14 /* READ BYTES PROCESSED X 4096 * / 
f @55 MDR16 /* WRITE BYTES PROCESSED X 4096 % / 
@57 MDR18 /* READ BLOCKS PROCESSED X 256 * / 
@58 MDR19 _/* WRITE BLOCKS PROCESSED X 256 te / 
@59 MDR20 | /* TRANSIENT DATA CHECKS (ISV) */ 
i @60 MDR21 /* RESERVED %/ 
@61 MDR22 /* CRITERIA WRITE TEMP. ERRORS */ 
@62 MDR23 /** CRITERIA READ TEMP. ERRORS */ 
| @63 MDR24 /* ERASE GAP COUNTS */ 
@64 MDR25 /** DRIVE EQUIPMENT CHECKS */ 
@65 MDR26 /* LOW ORDER POSITION OF RD/WR COUNTER  */ 
" @66 MDR27 /* IML/HRDWR EC LEVEL/SERIAL NO. */ 
@69 MDR30 /* READ ERROR RETRIES % / 
' @70 MDR31 /* BUFFER SEGMENT DEMARKED */ 
CC 
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OBR RECORD FORMAT FOR THE 3480 MAGNETIC TAPE SUBSYSTEM 


@1 CLASRC /* RECORD TYPE - OBR IS 30X 
@2 SYSREL /* SYSTEM RELEASE LEVEL 

@3 SWITCHO /* RECORD SWITCH 

@4 SWITCH1 /* RECORD SWITCH 

@5 SWITCH2 /* RECORD SWITCH. 

@6 SWITCH3 /* RECORD SWITCH 


@9 DATE /* DATE 

@13 TIME /* TIME 

@14 HRS /* TIME IN HOURS 

@15 MIN /* TIME IN MINUTES WITHIN THE HOUR 
@16 SEC /* TIME IN SECONDS WITHIN THE MINUTE 
@17 VERNO /* MACHINE VERSION CODE 

@18 CPUSER /* CPU SERIAL NUMBER 

@21 CPUMOD /* CPU MODEL NUMBER ( 0158, 3081, ETC ) 
@23 SPACE1 /* RESERVED 

@25 JOBID /* JOBID 

@50 SECUA /* SECONDARY CHANNEL AND UNIT ADDRESS 


@53 DEVTYPE /* DEVICE TYPE 8080X FOR 3480 
@55 DEPTYPEA /* DEVICE TYPE (RIGHTHAND ADJUST) 
@57 SDRCNT /* NUMBER OF BYTES IN IN SDR AREA 
@58 PCUA /* PRIMARY CUA OF DEVICE 

@61 IORETRY /* NUMBER OF I/O RETRIES 

@63 SENSCNT /* NUMBER OF BYTES IN SENSE FIELD 


@65 VOLSER /* VOLUMEN SERIAL 
@71 BLKLEN /* BLOCK LENGTH 
@73 HDRSER /* HEADER LABEL SERIAL 
@76 SPACE&4 + °#=/* RESERVED 
@81 OBROO /* UNIT CHECK 
@82 OBRO1 /* DEVICE STATUS 
@83 OBRO2 /* DATA PATH & ERROR POSITION 
@84 OBRO3 /* ERA CODE 
@85 OBRO4 /* BLKID 
@88 OBRO7 /* FORMAT 20=SENSE 21=BUFF LOG 
@89 OBRO8 /* DRIVE ERP CODE 
@90 OBROYI /* CU FLAGS 
@91 OBR10 _/* CU FRU 1, 1ST ERROR CODE 
@93 OBR12 . /* CU FRU 2, 2ND ERROR CODE 
@95 OBR14 /* CU FRU LAST ERROR 
@97 OBR16 /* CU HRDWR FRU CODES 
@99 OBR18 /* DRIVE FLAGS BYTE 1 
@100 OBR19 /* DRIVE FLAGS BYTE 2 
@101 OBR20 /* DRIVE FRU CODE 1 
@103 OBR22 /* DRIVE FRU CODE 2 
@105 OBR24 /* CU CHANNEL INTERFACES 
@106 OBR25 /* CU FEATURES 
@107 OBR26 /* CU UCODE EC LEVEL 
@108 OBR27 /* IML/HRDWR EC LEVEL/SERIAL NO. 
@111 OBR30 /* DRIVE ADDRESS LOGICAL/PHYSICAL 
@112 OBR31 /* DATA BYTE COUNT IN BUFFER (CDR) 
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3480 ERROR RECOVERY CATEGORIES 


Any 3480 error will be categorized as follows: 


1. " In fly " errors: Those errors corrected by internal Subsystem hard- 
ware without microcode concurrence. There is not real time data check 
indication nor performance degradation. 
MDR record ( MDR11, MDR12 ) will be updated by CU microcode. 
It won't be logged as Soft fail. 
It will be reported in EREP as ECC error. 
2. Error recovery succesful by the Subsystem ( Temporary error ). Sta- 
tistical records within the buffered log ( MDR's ) will be updated. An 
OBR will be logged in SYS1.LOGREC if Subsystem in forced logging. 
They will be logged as SF's or MSF's by R PLUS algorithms. 
3. Error retry unsuccesful /not able to be retried by the Subsystem. 
Actions will be: 

A. Log information pertinent to the error on SYS1.LOGREC 

B. Perform recovery actions as called by CPU. 
If CPU recovery actions are succesful, the OBR will be logged as " Re- 
covered error '" on EREP and won't be candidate for Hard Fail since OBR 
word 4 ( SWITCH! ) will indicate a temporary error, as will be discussed 
later on. 
If CPU recovery actions are unsuccesful, OBR will be logged as any Hard 


Fail category. 


4. Catastrophic errors ; Error or conditions which cause a loss of 
comunications to the Host system : Microprocessor errors, some Channel 
Adapter errors, power failures, etc. | 


This type of errors will be always logged as Hard Fail. 


RELIABILITY PLUS BASIC DEFINITIONS 


Following is a description of parameters used by RELIABILITY PLUS Inc. 
to measure 3480 performance: 


A. HARD FAIL ( HF ): Any device permanent error preventing the Customer 


job's completion as originally scheduled. Note that Hard Fail does not 
necessarily mean JOB ABEND 


Example: A Write permanent error on one drive, where a DDR is called and 
it is succesful. 


Although many permanent errors can be associated to a unique 
JOBID-VOLSER-CUA combination, only the first permanent error will be 
deemed to be a Hard Fail candidate. Remaining/Subsequent permanent er- 
rors will be considered as OTHER HARD FAILS. 


B. MEDIA HARD FAIL ( MHF ): Any permanent error caused by the same VOLID 
on different drives. First occurrence is classified as MHF, subsequent 
occurrences are classified as '' REPEAT MEDIA HARD FAIL ™ ( RMHF ). 


C. SOFT FAIL ( SF ): Any temporary error attributed to hardware fail- 
ures. 


D. MEDIA SOFT FAIL ( MSF ): Temporary errors attributed to Media fail- 
ures. 


Thus, the total number of temporary errors ( TE's ): 


TE's = SF's + MSF's 


RELIABILITY PLUS ALGORITHMS OVERVIEW 


All OBR records logged on SYS1.LOGREC will be analyzed and subsequently 
classified into one of the four Hard-Fail categories: 


0 


MEDIA ( MHF ) 
O REPEAT MEDIA ( RMHF ) 


re) OTHER ( OHF ) 


Oo 


DEVICE ( HF ) 
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OTHER is a category used to classified permanent OBR records that are 
not analyzed to be either MEDIA or DEVICE hard fails. 


The Algorithm has several steps: 
1. Discard all OBR records where SWITCH1, bitl is set. This will elimi- 
nate those "temporary OBR's", logged when Subsystem in forced error log- 
ging mode and those "recovered errors" ( Subsystem permanent errors but 
temporary from CPU standpoint ). This should explain why a Subsystem be- 
ing in forced logging mode ( temporary OBR's logged ) won't mean any 
difference to R PLUS algorithms: They simply ( temporary OBR's ) will be 
discarded. 
2. OBR's records with following ERA codes ( OBR record OBRO3 content ) 
will be directly assigned to OTHER FAILS : 

ERA 21 ( DATA STREAMING ERROR ) 

ERA 26 ( READ BACKWARD DATA CHECK ) 

ERA 36 ( DRIVE PATCH LOAD FAILURE ) 

ERA 39 ( BACKWARD AT BOT ) 

ERA 3B ( VOLUMEN REMOVED BY OPERATOR ) 

ERA 40 ( OVERRUN ) 


ERA 4B ( CONTROL UNIT AND DRIVE UNIT INCOMPATIBLE ) 


0 


3. OBR's with sns byte 0 (OBROO) = 40 ( Intervention required ) or sns 
byte 0 (OBROO) = 80 ( Command reject ) will be directly assigned to 
OTHER HARD FAILS 
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4. Following Subsystem Fault Symptom Codes ( FSC's ) will be directly 
assigned to OTHER Hard Fails: 


FSC 70C2 ( Block Id mismatch ) and ERA = 41 

FSC 7161 ( Blank Tape ) and ERA = 2E 

FSC 7153 ( Tape Void ) and ERA = 31 

FSC CF80 ( Serial Itfce Bus disabled ) and ERA = 42 


0000 


5. There are two special JOBID's which are handled in a different way by 
R PLUS: " EOS EXIT " and "*MASTER*", 


R PLUS does not necessarily discard all permanent OBR's with EOS EXIT or 


*MASTER* as JOBID's. They simply do not consider the existance of either 
of those JOBID's as a change to the JOBID. This becomes important when 
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considered that a change in either the JOBID, VOLSER or Drive address is 
a reason to count another HF. 
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For example: 


JOBID VOLSER CUA 


ABCDEF 112233 FFF 
*MASTER* 112233 FFF 


ABCDEF 112233 FFF 


would be counted as 3 HF's because of the change in JOBID's.In fact, R 
PLUS counts the error as one because *MASTER* is not considered as a 
change in the JOBID. 


If the permanent OBR is logged with a unique combination of VOLSER, 
JOBID and CUA, being *MASTER* the JOBID, R PLUS will count it as a HF. 


Experience shows that all OBR's logged with EOS EXIT being the JOBID, 
are always temporary OBR's, i.e., OBR SWITCH1 bitl is always set. 


6. All permanent errors not filtered out on above steps will be candi- 
dates for Media Hard Fails ( MHF's ). Their sor sequence is: 


wee VOLSER, CUA, DATE, TIME *** 


Any of these OBR records containing one VOLSER spanning at least two 
CUA's within 15 days will reflect a potential Media problem; hence, the 
first occurrence of this sequence is classified as a MEDIA hard fail 
( MHF ), and those remaining become REPEAT MEDIA hard fail ( RMHF ). 


Multiple Permanent OBR records with and ERA of 23 or 25, with any combi- 
nation of unique CUA/VOLSER occurring within one hour also reflects a 
potential MEDIA problem; hence, the first occurrence of this sequence is 
classified as a MEDIA hard fail, and the remaining become REPEAT MEDIA 
hard fails. This handles a swap-to-the-same-drive situation. 


7. Permanent OBR records not classified as MEDIA Hard fails on step 6 
will be candidates for Device ( HF's ) or OTHER hard fails. Their se- 
quence is : 

wis DATE, TIME, CUA, VOLSER, JOBID *** 


8. Those permanent OBR records not isolated in above steps are OTHER 
hard fails ( OHF ). | 


Once completed the analysis of all OBR records logged in SYS1.LOGREC the 
next step is to analyze the MDR records. All information related to NUM- 
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BER of temporary errors, usage in Mbytes, etc, is obtained from these 
records. 


MDR records will contain counts of temporary errors that are Read, write 
and equipment checks. These records also contain usage information in 
the form of Read/Write bytes processed, number of Read/Write blocks 
processed and number of Read/Write blocks corrected ( ECC errors ). 


Temporary errors are the sum of control unit temporary equipment checks 
( MDR13 contents ), drive temporary equipment checks ( MDR25 contents ) 
and Read and Write temporary errors ( MDRO& + MDRO9 + MDRI1O ). 


Thus; TE's 


CUTEC + DUTEC + WTE + RTE 


st) 
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SF + MSF 


R PLUS algorithms will first identified those temporary errors MEDIA re- 


lated ( MSF ); Remaining Temporary errors will be "Drive soft fails" or, 
simply Soft fails ( SF ). 


SF = TE - MSF 


Algorithm to "isolate'’ MEDIA soft fails ( MSF ) is as follows: 


"MEDIA soft fails " will be analyzed from daily drive activity, by 
Volumen-serial. A VOLSER's temporary error count may be deemed to be a 
'" Media soft fail " if either of the following two conditions occur: 


(1). If the portion of VOLSER temporary errors versus the drive tempo- 
rary errors divided by the portion of VOLSER megabytes processed versus 
the drive megabytes processed exceeds 10 


TEvol / TE drive 
MBYTES vol / MBYTES drive 


(2). When the daily drive activity includes at least two VOLSER's, and 


for any VOLSER temporary errors that account for more than 75 % of the 
total Drive's temporary errors. 


For either case, the VOLSER is flagged as containing "media soft fails." 
(i.e., there are no drive Soft fails). For those VOLSER's that have no 


media soft fails, their temporary error count are the "drive soft 
fails". 
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VOLSER's failing algorithm (1) will be considered "MEDIA SOFT FAILS 
BASED ON CRITERIA 1"; The same applies for those VOLSER's being detected 
on (2) algorithm..."MEDIA SOFT FAILS BASED ON CRITERIA 2". 
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RELIABILITY PLUS VERSUS EREP 


The SYS1.LOGREC data set can be analyzed many ways, and interpretations 
of permanent OBR records vary. Figure below shows two ways that EREP and 
Reliability Plus analyze SYS1.LOGREC data. MDR records are not shown be- 


cause bytes and blocks processed ( from MDR only ) should be interpreted 
only one-way; by counting. 


EREP reporting Reliability Plus 
candidate Control drive candidates candidate 
for media | unit perm perm for media for 

ERA perm error error error hard fail hard fail 
22 X X X 
23 X X X X 
25 X X X X 
28 X X X 
2C X X X 
2D X X X 
2E X X X 
31 X X X 
32 X X X 
35 X X X 
36 X 
39 X 
3B OPERATOR RELATED ERROR 
40 X 
41 X X X 
47 X X X 
49 X X X 
4A = X Xx X 
4B X 
4C X Xx X 


Fig. EREP versus Reliability Plus analysis of permanent OBR records. 
ERA codes are defined in the appendix. 
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Error Recovery Actions ( ERA ) are a one-byte code in OBR sense byte 3, 
and are logged based on the results of Subsystem microcode Error Recov- 
ery Programs that analyze fault symtoms for which recovery is attempted. 
Results of these actions is reflected as the ERA. Only those ERA's an- 
ticipated to be used by R PLUS are listed in the appendix. Refer to 
GA32-0042-0 "IBM 3480 Magnetic Tape Subsystem Reference : Error Recov- 
ery Procedures’ for further information. 


All permanent OBR records are analyzed and classified into four perma- 
nent error types for reporting by the System Exception report: 

VOLUME- related permanent error 
VOLUMEN-related errors are based on an ERA of 23 and 25 with accompany- 
ing VOLSER. The sort sequence is: 

weet DATE, VOLSER, CUA *** 
The sort is done independently for each of the two ERA's. Any OBR re- 


cords containing one VOLSER spanning at least two CUA's for any given 
day constitutes a VOLUME-related permanent error. 


OPERATOR 


Permanent OBR records with an ERA of 3B are OPERATOR permanent errors. 


CONTROL UNIT 


Permanent OBR records with an ERA of 40, 47, 49 and 4C are CONTROL UNIT 
permanent errors. 


DRIVE 


Permanent OBR records not classified on above steps become DRIVE perma- 
nent errors.. 
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LOGREC ANALYSIS METHODOLOGY 


Find below the current "flow of operations" for each LOGREC tape re- 
ceived in Valencia. 
1. Customer LOGREC tape received in Val. Plant ( Mail, IBTS ) 


2. Customer LOGREC tape loaded on data base and converted into SAS data 
set. 


3. Monthly overall report generated for all Customers. This report in- 
cludes overall performance parameters ( R PLUS and RAS Read/Write 
reliability figures ) 


4. Feedback report sent to Country R PLUS coordinator/AFSG specialist. 
Details of report will be explain hereafter. 


5. Detailed Customer analysis by Product Eng. based on: 


o Poor performance on monthy report. 
o Valencia Project Office request. 
6. Several types of Customer analysis based on: 
o Poor Hard Fail ( HF ) performance ( Perm. errors ) 


o Poor Soft fail ( SF ) performance ( Temp. errors ) 
7. Hard Fails analysis includes: 


o Identify OBR's causing HF's. 
o Clasify HF's per CUA. 

. Glaceiey HF's per ERA code. 
o Classify HF's per VOLID. 


7.1. MHF ( Media Hard fail analysis ); Identify VOLID's flagged by R 
PLUS Simulation program. 


7.2. EREP analysis ( VOLPE's, DRPE's, CUPE's, OPPE's,OTHPE'’s ) versus R 
PLUS HF's. 


7.3. Action plan based on above information. 
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8. Soft fail analysis method depending on: 


o Poor RAS performance on IBM terms ( KMB-CRTE, KMB-CWTE ) 


o Poor performance on R PLUS terms ( SF-ratio < 1 ) 
9. Analysis upon degraded RAS performance includes: 


o RD/WR temp. error distribution per CUA and week. 
o RD/WR temp. error distribution per CU ( Local-Remote ) 
o Load balancing efficiency. 


o MDR's records analysis per CUA exceeding a threshold 
criteria 


o Blocks corrected vs blocks processed ratio per CUA. 


o Identify major detractors ( VOLID's ) to degraded 
performance 


o OBR ( FMT 19 ) records per CUA. 
o Analysis based on Customer specific problems by 
designing " one-time shot " programs. 
9.1. Poor SF-ratio analysis includes: 
o Evaluation of CU, DU temporary equipment checks by 
CUA. 
o Evaluation of RD/WR temp. errors ( all ) based on 


RAS information. This info. is only available if 
Subsystem is in forced logging mode. 


10.Customer Tape Library maintenance ( required at least 2 months of 
history on data base. ) 


11.Media Soft fail analysis ( MSF ) as per R PLUS criteria using R + 
simulation program and NDD criteria 
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DEFINITION OF PROGRAM PARAMETERS. 


As stated before, after reception of each Customer LOGREC tape, a report 
is generated and sent to each Country R PLUS coordinator/AFSG special- 
ist. The main goals of this report are: 


1. To provide responsible persons with a general overview of product 
performance, both as per R PLUS and IBM RAS criteria. 


2. To identify those VOLSERS which do require some action by on-site CE 
( remove them if the number of TE's is confirmed after an aditional test 
on that VOLSER using OLT C, etc ). A good library maintenace policy is 
the key for a good performing product. 


3. To further assist AFSG/HCS teams on identifying those drives with 
poor performance trend or errors which just took too long to fix. 


The feedback report is organized as follows: 


A. HARDFAILS DISTRIBUTION BY CUA (& WEEK): This sheet provides the dis- 
tribution of Hard fails per Drive (CUA) and week. Each Drive address 
( CUA ) is associated, by the program, to a symbol (one letter). This 
report will highlight those Drives whose performance is poor along the 


time or which were performing poorly too long, before a corrective 
action was taken. 


B. HARDFAILS DISTRIBUTION BY ERA (&WEEK): Same as before, this report 


will indicate, by week, what type of Hard fail was it ( each Hard fail 
has an associated ERA code ) 


C. PERMANENT ERRORS DISTRIBUTION BY CUA (& WEEK): Same as chart in (A), 
but ALL permanent errors, regardless of whether they are permanent’ er- 
rors or not, will be displayed per Drive (CUA) and week. 


D. PERMANENT ERRORS DISTRIBUTION BY ERA : Same as chart (B), but for 
permanent errors. _ 


E. OBR (BY DATE, TIME) FOR HARD FAIL ANALYSIS (WITH SENSE) : This report 
will provide the 32 bytes of sense data for each permanent OBR logged on 
the reported period. The program will further classify each OBR as_ one 


of the possible Hard fail categories ( HF, MHF, RMHF, CWHF CRHF ) and as 
per EREP-like parameters. 


Note that : 

CWHF = Criteria Write Hard Fail= A Hard fail due to a Write perm. error 
CRHF = Criteria Read Hard Fail = A Hard fail due to Read perm. error 
CWPE = Criteria Write permanent error 

CRPE = Criteria Read permanent error 

DRPE = Drive permanent error 
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CUPE = Control Unit permanent error 
OPPE = Operator permanent error 
OTHPE= Other permanent error 

VOLPE= Volumen permanent error 


Next chapter will explain in detail what a "criteria" permanent error 
is. 


F. MEDIA SOFT ANALYSIS OF MDR RECORDS (BY VOLSER, BY DATE). 


This report will be provide a list with those VOLSERS exceeding a given 
number of TE's ( currently, threshold is set to 30 but previously was on 
10 ). Each of the VOLSERS exceeding the threshold will be analyzed using 
R PLUS Soft fail algorithms (1) and (2). An ACTION REQUIRED will be 
posted if the VOLSER is found to be a "Media Soft Fail”. 


Nomenclature is as follows: 


VOLSER = flagged VOLSER 

DATE = The julian date in which above VOLSER was flagged. 

CUA = Drive address where VOLSER was flagged. 

DVOL = Number of volumes used on above CUA on this DATE. 

DMBYTE = Total number of Mbytes processed on this CUA on this DATE. 
MBYTES = Total number of Mbytes processed on this specific VOLSER. 
RMBYTE = Read Mbytes on this VOLSER 

WMBYTE = Write Mbytes on this VOLSER 

DTE = Total number of temporary errors on this CUA this DATE, 
TE = Number of temporary errors on this VOLSER. 

CUTEC = Control Unit temporary equipment Checks 

DUTEC = Drive Unit temporary equipment checks 

RTE = number of read temporary errors on this VOLSER 

WTE = number of write temporary errors on this VOLSER 

CRTE = number of criteria read temporary errors on this VOLSER 
CWITE = number of criteria write temporary errors on this VOLSER 
MSF = number of Media soft fails 

MSF1 = VOLSER flagged as criteria (1) 

MSF2 = VOLSER flagged as criteria (2) 


In adition, a new report is being issued, flagging those VOLSERS which 
fail NDD criteria ( more than 18 write temporary errors on two different 
drives ). 


F. SOFTFAILS DISTRIBUTION BY CUA (& WEEK) : Distribution of Soft fails 
by Drive (CUA) and week. This chart is useful to detect performance de- 
gradation, in terms of temporary errors, for specific Drive units. 


G. RELIABILITY PLUS OVERVIEW (BY WEEK): This report will provide all the 


basic Performance figures for all 3480 Subsystems installed in the ac- 
count. Terminology should be already familiar to the reader. 
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H. RELIABILITY PLUS OVERVIEW (BY DRIVE) : This report will provide basic 


performance parameters ( both, R PLUS and IBM RAS ) for each Drive Unit 
on the reported period. 


I. SOFTFAIL OVERVIEW (BY MONTH): This report will provide basic informa- 


tion on Subsystem performance in terms of temporary errors, by month. 
Some new terms used in here are: 


KBLKS = Total number of KBlocks processed. 

NVOLS = Total number of different volumes used. 

RECC = Total number of read blocks corrected "on the fly”. 

WECC = Total number of write blocks corrected "on the fly”. 

NAR = Number of “ACTION REQUIRED", i.e., volumes flagged by MSF 


algorithms. 


J. HARDFAIL OVERVIEW (BY MONTH): Will provide total amounts of permanent 
errors, by month, classified within each category. 


K. OVERALL OVERVIEW (BY MONTH) : This report will provide overall Sub- 
system performance, both on R PLUS and IBM terms. 


Aditional reports, still on development process, will allow an easier 
tracking of Drive Units performance by HCS/AFSG teams. The program will 
also detect those Drives with poor performance trend, not due to spe- 


cific volsers, and will flag them as "ACTION REQUIRED", similar to what 
is currently done with Tapes. 


Zt 
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3480 BASIC RELIABILITY FIGURES 


There are different ways to define 3480 Reliability figures. Following 
paragraphs will discuss what current experience is about 3480 perform- 
ance expectations. : 


1. HARDWARE RELIABILITY: It can be defined in several ways... 


A. Repair actions rate (RA), i.e., number of RA's per machine, per 
month due to hardware failures. Usage is estimated in 700 power-on hours 
per month. This figure is calculated taking into account the expected 
failure rate for the technology used in 3480. Current values are 


RA's per A22 (Control Unit) per month : 0.110 
RA's per B22 (Drive Unit) per month: 0.191 


Therefore, the number of "expected hardware failures" on a 2x8 subsystem 
per month (average figure, on a 3 month rolling..) should be: 


RA's TOTAL/MONTH = 1.748 


B. From the Customer point of view, hardware reliability is expected to 
be: 


at. 
KY 


250 GBYTE/HF, being a HF any device permanent error which 
prevents the Customer job completion as originally scheduled 


OR 


ate 
@wv 


65 GBYTE between permanent errors (ANY permanent error) 
This is the " MBYTES/PERMANENT ERROR “ which appears in EREP. 
C. From the temporary errors stand point, field experience shows 

* 1 GBYTE/SF , being a SF any DEVICE related temporary error 
OR 


ate 
@e 


500 MBytes or higher per temporary write error 


Following are, for reference, U.S. Field procedures to monitor 3480 per- 
formance. Similar procedures should be desirable to implement in EMEA. 


FIELD PROCEDURES 
* ACCOUNT CE 


- Tracks Performance to 65,000 Megabytes per permanent error. 
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- Alert Branch specialist when account is below criteria two 
consecutive weeks. 


BRANCH SPECIALIST 


- Alert Region when account is below criteria for second week. 


REGION PRODUCT COORDINATOR 


- Notify NSD HQ (FTO) when account is below criteria for third week. 
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PROGRAM ACHIVEMENTS. 


Following Customers are currently following the Valencia 3480 Reliabil- 
ity Assurance program. Most of them report on a regular basis, usually 
monthly, and are R PLUS Customers. Some of the accounts report due to 
very specific problems and, usually, over short periods of time and un- 
til their specific problems are solved. 

DENMARK 


HANDELSBANKEN, SDC, DANSKE BANKEN, MULTIDATA, DATACENTRALEN 
GERMANY 
DATEV, HENKEL, HEW HAMBURG, HAMBURG MANHEIMER, QUELLE, BOSCH, HDV 


FIDUCIA, GAD 


U.K. 


WESTERN GEO, FRIENDS PROVIDENT, COMMERCIAL UNION, AMERICAN EXPRESS 


NETHERLANDS 


RABO BANK, RABO 2, KLM 


FRANCE 


ESSO, CREDIT LYONAIS, SDRM, AGF-GIE, AIR FRANCE 


SWITZERLAND 


ERZ BV, SWISSAIR, CIBA GEIGY, ERZ PTT, BEDAG 


BELGIUM 


BOERENBOND 


SPAIN 


EL CORTE INGLES 


SWEDEN 


SE BANK 


24 


i a’ a oe a ee elle ee ee ee ee. ee!) ei 
{how ers . 4 ! 


i 


25 


SOUTH AFRICAN MUTUAL 


SOUTH AFRICA 
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APPENDIX : ERA CODES 


' This appendix supplies the error codes used to analyze SYS1.LOGREC data. 
i The 3480 Design Control Document specifies at least 45 Error Recovery 
Action ( ERA ) codes for the Subsystem microcode Error Recovery Pro- 
grams. They are logged as both OBR permanent and temporary error records 
| and MDR records. The ERA's expected to be Hard Fail candidates are: 
ERA DESCRIPTION 
| 22 Path equipment Check 
I 23 Data check Read 
ie 25 Data Check Write 
j 
J 28 Write ID mark check 
' 2C Permanent equipment Check ; 
2D Data security erase fail 
( 2E Not Capable ( EOT error ) 
i 31 Tape void 
i 32 Tape tension lost 
35 Drive equipment check 
f 36 Drive patch load failure 
. 39 Backward at BOT 
i 3B Volumen removed by operator 
| 40 Overrun 
41 Block ID sequencing 
‘ 47 Control Unit error 
49 Bus out Parity 
" GA Control Unit ERP failed 
i 4C Control Unit error recovered ( Check 1 ) 
: 
; 26 
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